55 research outputs found

    Flow control for Latency-Critical RPCs

    Get PDF
    In today’s modern datacenters, the waiting time spent within a server’s queue is a major contributor of the end-to-end tail latency of ÎŒs-scale remote procedure calls. In traditional TCP, congestion control handles in-network congestion, while flow control was designed to avoid memory overruns in streaming scenarios. The latter is unfortunately oblivious to the load on the server when processing short requests from multiple clients at very high rates. Acknowledging flow control as the mechanism that controls queuing on the end-host, we propose a different flow control mechanism that depends on the application-specific service-level objectives and controls the waiting time in the receivers queue by adjusting the incoming load accordingly. We design this latency-aware flow control mechanism as part of TCP by maintaining a wire-compatible header format without introducing extra messages. We implement a proof-of-concept userspace TCP stack on top of DPDK and we show that the new flow control mechanism prevents applications from violating service-level objectives in a single-server environment by throttling the incoming requests. We demonstrate the true benefit of the approach in a replicated, multi-server scenario, where independent clients leverage the flow-control signal to avoid directing requests to the overloaded server

    Automated Debugging for Arbitrarily Long Executions

    Get PDF
    One of the most energy-draining and frustrating parts of software development is playing detective with elu-sive bugs. In this paper we argue that automated post-mortem debugging of failures is feasible for real, in-production systems with no runtime recording. We pro-pose reverse execution synthesis (RES), a technique that takes a coredump obtained after a failure and automat-ically computes the suffix of an execution that leads to that coredump. RES provides a way to then play back this suffix in a debugger deterministically, over and over again. We argue that the RES approach could be used to (1) automatically classify bug reports based on their root cause, (2) automatically identify coredumps for which hardware errors (e.g., bad memory), not software bugs are to blame, and (3) ultimately help developers repro-duce the root cause of the failure in order to debug it.

    How to Measure the Killer Microsecond

    Get PDF
    Datacenter-networking research requires tools to both generate traffic and accurately measure latency and throughput. While hardware-based tools have long existed commercially, they are primarily used to validate ASICs and lack flexibility, e.g. to study new protocols. They are also too expensive for academics. The recent development of kernel-bypass networking and advanced NIC features such as hardware timestamping have created new opportunities for accurate latency measurements. This paper compares these two approaches, and in particular whether commodity servers and NICs, when properly configured, can measure the latency distributions as precisely as specialized hardware. Our work shows that well-designed commodity solutions can capture subtle differences in the tail latency of stateless UDP traffic. We use hardware devices as the ground truth, both to measure latency and to forward traffic. We compare the ground truth with observations that combine five latency-measuring clients and five different port forwarding solutions and configurations. State-of-the-art software such as MoonGen that uses NIC hardware timestamping provides sufficient visibility into tail latencies to study the effect of subtle operating system configuration changes. We also observe that the kernel-bypass-based TRex software, that only relies on the CPU to timestamp traffic, can also provide solid results when NIC timestamps are not available for a particular protocol or device

    Measuring Latency: Am I doing it right?

    Get PDF
    This poster describes the basic methodology to conduct an accurate latency experiment

    Benchmarking, Analysis, and Optimization of Serverless Function Snapshots

    Get PDF
    Serverless computing has seen rapid adoption due to its high scalability and flexible, pay-as-you-go billing model. In serverless, developers structure their services as a collection of functions, sporadically invoked by various events like clicks. High inter-arrival time variability of function invocations motivates the providers to start new function instances upon each invocation, leading to significant cold-start delays that degrade user experience. To reduce cold-start latency, the industry has turned to snapshotting, whereby an image of a fully-booted function is stored on disk, enabling a faster invocation compared to booting a function from scratch. This work introduces vHive, an open-source framework for serverless experimentation with the goal of enabling researchers to study and innovate across the entire serverless stack. Using vHive, we characterize a state-of-the-art snapshot-based serverless infrastructure, based on industry-leading Containerd orchestration framework and Firecracker hypervisor technologies. We find that the execution time of a function started from a snapshot is 95% higher, on average, than when the same function is memory-resident. We show that the high latency is attributable to frequent page faults as the function's state is brought from disk into guest memory one page at a time. Our analysis further reveals that functions access the same stable working set of pages across different invocations of the same function. By leveraging this insight, we build REAP, a light-weight software mechanism for serverless hosts that records functions' stable working set of guest memory pages and proactively prefetches it from disk into memory. Compared to baseline snapshotting, REAP slashes the cold-start delays by 3.7x, on average.Comment: To appear in ASPLOS 202

    Lightweight Snapshots and System-level Backtracking

    Get PDF
    We propose a new system-level abstraction, the lightweight immutable execution snapshot, which combines the immutable characteristics of checkpoints with the direct integration into the virtual memory subsystem of standard mutable address spaces. The abstraction can give arbitrary x86 programs and libraries system-level support for backtracking (akin to logic programming) and the ability to manipulate an entire address space as an immutable data structure (akin to functional programming). Our proposed implementation leverages modern x86 hardware-virtualization support

    Design Guidelines for High-Performance SCM Hierarchies

    Full text link
    With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it will deliver the much-anticipated high density and access latencies within only a few factors of DRAM. Nevertheless, the latency-sensitive nature of memory-resident services makes seamless integration of SCM in servers questionable. In this paper, we ask the question of how best to introduce SCM for such servers to improve overall performance/cost over existing DRAM-only architectures. We first show that even with the most optimistic latency projections for SCM, the higher memory access latency results in prohibitive performance degradation. However, we find that deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the performance of an SCM-mostly memory system competitive. The high degree of spatial locality that memory-resident services exhibit not only simplifies the DRAM cache's design as page-based, but also enables the amortization of increased SCM access latencies and the mitigation of SCM's read/write latency disparity. We identify the set of memory hierarchy design parameters that plays a key role in the performance and cost of a memory system combining an SCM technology and a 3D stacked DRAM cache. We then introduce a methodology to drive provisioning for each of these design parameters under a target performance/cost goal. Finally, we use our methodology to derive concrete results for specific SCM technologies. With PCM as a case study, we show that a two bits/cell technology hits the performance/cost sweet spot, reducing the memory subsystem cost by 40% while keeping performance within 3% of the best performing DRAM-only system, whereas single-level and triple-level cell organizations are impractical for use as memory replacements.Comment: Published at MEMSYS'1

    Expedited Data Transfers for Serverless Clouds

    Full text link
    Serverless computing has emerged as a popular cloud deployment paradigm. In serverless, the developers implement their application as a set of chained functions that form a workflow in which functions invoke each other. The cloud providers are responsible for automatically scaling the number of instances for each function on demand and forwarding the requests in a workflow to the appropriate function instance. Problematically, today's serverless clouds lack efficient support for cross-function data transfers in a workflow, preventing the efficient execution of data-intensive serverless applications. In production clouds, functions transmit intermediate, i.e., ephemeral, data to other functions either as part of invocation HTTP requests (i.e., inline) or via third-party services, such as AWS S3 storage or AWS ElastiCache in-memory cache. The former approach is restricted to small transfer sizes, while the latter supports arbitrary transfers but suffers from performance and cost overheads. This work introduces Expedited Data Transfers (XDT), an API-preserving high-performance data communication method for serverless that enables direct function-to-function transfers. With XDT, a trusted component of the sender function buffers the payload in its memory and sends a secure reference to the receiver, which is picked by the load balancer and autoscaler based on the current load. Using the reference, the receiver instance pulls the transmitted data directly from the sender's memory. XDT is natively compatible with existing autoscaling infrastructure, preserves function invocation semantics, is secure, and avoids the cost and performance overheads of using an intermediate service for data transfers. We prototype our system in vHive/Knative deployed on a cluster of AWS EC2 nodes, showing that XDT improves latency, bandwidth, and cost over AWS S3 and ElasticCache.Comment: latest versio
    • 

    corecore